Data Summary Section
Observations from the Summary
Mean residual sugar level is 5.4 g/l, but there is a sample of very sweet wine with 65.8 g/l (an extreme outlier). Mean free sulfur dioxide is 30.5 ppm. Max value is 289 which is quite high as 75% is 41 ppm. PH of wine is within range from 2.7 till 4, mean 3.2. There is no basic wines in this dataset. Alcohol: lightest wine is 8%, strongest is 14.9. Minimum quality mark is 3, mean 5.8, highest is 9.
Univariate Plots Section
In this section, I will plot histograms for all the variables by color and show a summary to get a general sense of the dataset. For plotting the histograms I found a function which can give optimal binwidth. I know the solution is not perfect but I checked the histograms with manual binwidths after plotting the variables couple of times and the solution was very close to the manual histograms. I believe this technique will be really helpful in exploring a number of other datasets where we can plot the variables one by one really quickly.
Quality of Wine
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.818 6.000 9.000

From the above summary and plot it is evident that for both colors it’s a normal distribution even when the number of samples are very different for each color. Though from the variable descriptions, quality is supposed to follow the range 1 - 10. However, there are no wines with 1, 2 or 10 quality.
Level of Alcohol
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.50 10.30 10.49 11.30 14.90

Alcohol level distribution looks skewed. red wine sample gives the same pattern of alcohol level distribution as while wines. Most frequently wines have 9.5%, mean is 10.5% of alcohol.
Level of Fixed Acidity
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.800 6.400 7.000 7.215 7.700 15.900

Fixed Acidity distribution looks normal and both the wines follow somewhat similar pattern. The wines have extreme outliers at 3.8 and 15.
Level of Volatile Acidity
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0800 0.2300 0.2900 0.3397 0.4000 1.5800

Volatile Acidity distribution looks normal for white wine while it is very spread for red wine. and However, from the histogram it is clear that red wines have more volatile acidity in general than the white wines. From the summary, we can see an extreme outlier at 1.58
Level of Citric Acid
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.2500 0.3100 0.3186 0.3900 1.6600

Citric Acidity for red wine has multiple peaks and same goes for white wine. From the summary, we can see there is an extreme outlier at 1.66.
Residual Sugar
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.800 3.000 5.443 8.100 65.800

Residual distribution looks skewed. red wine sample gives the same pattern of alcohol level distribution as while wines. There is a very very sweet wine in our sample at 65.8 g/dm^3.
Level of Chlorides
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00900 0.03800 0.04700 0.05603 0.06500 0.61100

Chlorides distribution looks normal both for white and red wines. However, it appears to be shifted for higher value of chlorides.
Level of Free Sulfur Dioxide
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 17.00 29.00 30.53 41.00 289.00

Free Sulfur Dioxide distribution looks normal for white wine and skewed for red wine. There is an extreme outlier at 289.
Level of Total Sulfur Dioxide
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.0 77.0 118.0 115.7 156.0 440.0

Total Sulfur Dioxide distribution looks normal for white wine and skewed for red wine which is in accordance with the Free Sulfur Dioxide distribution. Here as well, we can see one outlier at 440.
Density
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9923 0.9949 0.9947 0.9970 1.0390

Density Distribution for red wine looks normal though for white wine it is very close to a normal distribution but it is slightly skewed. Density ranges between 0.9871 and 1.0390 g/cm^3.
Level of pH
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.720 3.110 3.210 3.219 3.320 4.010

pH Distribution for both red and white wine looks normal in nature. From the histogram it looks like red wine has more pH in general i.e. less acidic in nature.
Level of Sulphates
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2200 0.4300 0.5100 0.5313 0.6000 2.0000

Sulphate distribution for both red and white wine samples look normal. Though here as well, we can see an extreme outlier at 2.0
Univariate Analysis
What is the structure of your dataset?
For this project, I have combined red and white datasets offered in Data Set Options document. After combining the datasets, initially there were 6497 observations i.e. wine samples. Variable X defines the sample number and is of no significance to this analysis. Each sample has been graded on quality from 1 to 10 (1 being the worst quality and 10 being the best) though in this dataset there are wines ranging from 3 to 9 for both red and white wines. Quality follows a normal distribution.
What is/are the main feature(s) of interest in your dataset?
The main feature I am concerned with is the quality and I expect to investigate the most prominent features with the strongest effect on the quality. Further, I would like to see which variables are connected to each other.
What other features in the dataset do you think will help support your investigation into your feature(s) of interest?
I think alcohol and pH will play a significant role becuase they both play a significant role on the taste of the wine. Before starting with this analysis, I would have thought of age of the wine as another important feature, but surprisingly it is not part of the given features. Anyways, if the rest of the variables will be of importance or not, we shall see as we investigate further.
Did you create any new variables from existing variables in the dataset?
For combining the red and white wine datasets, I created one variable color which tells whether the wine is white or red.
Bivariate Plots Section
I will start with creating pair plots for all the variables except X. Further, I will also create and plot correlation matrix for all the variables which will help me in finding the variables which are most related to quality.
Pair plots for white wine

Pair plots for red wine

Correlation Matrix for red wine

Correlation Matrix for white wine

This correlation matrix is a 12X12 cut off at -x = y with each square representing the calculated value of the correlation coefficient between the 2 intersecting variables. It’s gradient is measured from 1 to -1 colored from dark blue to dark red respectively. These limits fade to white as the correlation approaches zero. We can match the color of a square to its corresponding place on the legend to understand the approximate correlation of the variables in question.
From the correlation matrix and pair plots, we can see there is strong correlation in the following pairs
- Alcohol vs Density (for both red and white)
- Fixed Acidity vs Density (for red wine)
- Residual Sugar vs Density (for both red and white wine)
- Residual Sugar vs Alcohol (for white wine)
- Chlorides vs Density (for both red and white wine)
- Chlorides vs Sulphates (for red wine)
From the above plots we can observe the following:
- With increase in alcohol, there is decrease in density and for both the wines there is a very strong correlation.
- For red wine, there is increase in density with increase in fixed acidity. For white wine, we can see that there is very weak correlation.
- With increase in residual sugar, there is increase in density and for both the wines there is a similar correlation.
- For white wine, there is strong negative correlation between alcohol and residual sugar, while for red wine the relationship is very weak.
- For chlorides and density, there is positive correlation for both red and white wine. Howver, for white it is stronger than red.
- For red wine, there is increase in sulphates with increase in chlorides. However, for white wine the relationship is very weak.
From the above plots we can observe the following:
- For both red and white wine, there is a strong positive corrrelation between quality and alcohol.
- For both red and white wine, there is a negative correlation between quality and density.
- For both red and white wine, there is a negative correlation between qualtiy and volatile acidity.
- For both red and white wine, there is a positive correlation between quality and citric acid.
Bivariate Analysis
Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?
Quality didn’t have the strongest associations in the list. Part of the lack of correlations might have to do with the dataset being small and not well distributed. The strong correlations that existed are between chemicals which I would expect to be highly correlated with each other. pH, fixed acidity, and citric.acid are all strongly correlated, free.sulfur.dioxide and total.sulfur.dioxide have a strong positive correlation. I don’t understand why the ingredients I ended up are the most related to wine quality. Alcohol makes sense and I had thought that from the beginning but its strange to me that sulphates, citric acid, or volatile acidity would be particularly sensitive to quality.
Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?
I couldn’t understand the strong relationship between density & ph and density & alcohol. Maybe I will need to explore this more chemically and see it how density is related to these variables. Other than this I found a number of variables which were significantly different for red and white wine such as fixed acidity, volatile acidity, residual sugar and total sulfur dioxide.
What was the strongest relationship you found?
I found the strongest correlation between density and residual sugar at 0.84 for white wine while for the red wine it was between pH and fixed acidity at - 0.68. For both white and red wine, the strongest correlation to quality was alcohol at 0.44 and 0.48 respectively.
Multivariate Plots Section
For this section, I will choose a few pairs of variables for which found a strong correlation and compare them against quality and color.
Citric Acid and Alcohol

In these plots we can notice that most of red wine is spread out evenly, for white wine citric acid level is concentrated in 0.2 - 0.4 range.
pH and Alcohol

pH and Alcohol have quite similar distribution for both red and white wine. However, white wines generally start with a pH of 2.9 while most red wines start around 3.1.
Chlorides and Sulphates

From the plot we can see that Sulfates and chlorides for white wine are spread out more than those for red wine.
Volatile Acidity and Alcohol

From this plot, we can see there is a strong relationship between alcohol and volatile acidity for both red and white wines.
Model for quality
From the bivariate analysis, it was clear that quality is strongly affected by alcohol, density, volatile acidity and citric acid. Though from the correlation matrix it was clear that density and alcohol have a strong correlation and so do volatile acidity and citric acid. So, to reduce possible multicolinearity, we should ideally be picking one variable from each pair. For my model, I would choose alcohol and volatile acidity and plot model for red & white wine separately.
Multivariate Analysis
Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?
Strong relationship between alcohol and volatile acidity for both red and white wines, led me to create a linear model for predicting quality.
Were there any interesting or surprising interactions between features?
From the linear model and it’s coefficients it was surprising to see that decrease in the volatile acidity and increase in alcohol content makes it a better wine for both red and white wines even when there are a number of other factors are different. Another thing that I noticed was that the model coefficients for red and white wine quality are very similar.
OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.
I made a model using volatile acidity and alcohol content of wine which predicts its quality. I believe tihs model can be made better with a larger dataset. The smaller patterns which I had to neglect could become important with larger datasets and it will be interesting to include those in the model.
Final Plots and Summary
Plot One: Quality of Wine

Description
This is the first plot which I am choosing univariate plot in my final plot section. The reason I choose this plot that this gives us a distribution of quality of wines for red and white wine. The quality with highest count is 6 for white wine while it is 5 for red wine. The distribution for both red and white wines look normal in nature and our dataset is rated mostly between 5 and 6. According to description of quality variable, it is supposed to range between 1 and 10. However, in our sample, I didn’t find 5 any wine rated with quality 1, 2 or 10. The mean quality came out to be 5.636 and 5.878 for red and white wine respectively. For both wines, the median quality was 6.
Plot Two: Density vs Alcohol

Description
For the second plot I choosing one bivariate plot. The reason I choose this plot because Density and Alcohol showed the strongest correlation among all wine parameters and this strong correlation led us to exclude Density from our linear model. For white wine, the correlation between alcohol and density is -0.78 while for red wine it is -0.49 which is also evident from the plot in which the line of best fit is with a negative slope i.e with increase in Alcohol, there is a decrease in Density.
Plot Three: Alcohol vs Volatile Acidity

Description
Since I used Alcohol and Volatile acidity for my model, it only makes sense to use it as the final plot and see a general trend of quality over volatile acidity ~ alcohol. We can see that for better quality wines volatile acidity is lesser and they have higher level of alcohol. We can see there is a strong relationship between alcohol and volatile acidity for both red and white wines. For red wine linear model coefficients for Intercept, alcohol and volatile.acidity are 3.0955, 0.3138 and -1.3836. for white wine linear model coefficients for Intercept, alcohol and volatile.acidity are 3.0173, 0.3244 and -1.9792.
Reflection
This analysis was challenging because these are chemical properties and I know I did not use the appropriate models for chemical properties. I must say that tidy datasets are relatively easy to explore. So, on this notion I chose to use wine data and decided to combine red and white wine data. Through this exploratory data analysis, I was able to identify the key factors that determine and drive wine quality, mainly: alcohol content and volatile acidity. From the plots and models, it is evident that quality of wine increases with increase in alcohol contents and decrease in volatile acidity. It was a very sample of wines and especially for red wine. So, I believe we still are missing on many trends and other variables which might be crucial for a better model. I must say that quality is a subjective measure in my opinion to be measured by wine experts and it depends a number of other variables such as temperature, location, humidity, age of the wine and to create an accurate mathematical model based on the variables will be really interesting. On closing note from this analysis I can conclude that wine experts want the high alcohol contents to get them drunk but want the low volatile acidity and hold the acidity for a better quality wine.